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Abstract 



BIG criterion is widely used by the neural-network community for 
C^ , model selection tasks, although its convergence properties are not always 

theoretically established. In this paper we will focus on estimating the 
number of components in a mixture of multilayer perceptrons and prov- 
ing the convergence of the BIG criterion in this frame. The penalized 
marginal-likelihood for mixture models and hidden Markov models intro- 
1^ , duced by Keribin (2000) and, respectively, Gassiat (2002) is extended 

to mixtures of multilayer perceptrons for which a penalized-likelihood 
criterion is proposed. We prove its convergence under some hypothesis 
which involve essentially the bracketing entropy of the generalized score- 
functions class and illustrate it by some numerical examples. 



o. 

^ ■ 1 Introduction 

Although hnear models have been the standard tool for time series analysis 
k^ ^ for a long time, their limitations have been underlined during the past twenty 

'V^ ■ years. Real data often exhibit characteristics that are not taken into account by 

C^ I linear models. Financial series, for instance, alternate strong and weak volatil- 

ity periods, while economic series are often related to the business cycle and 
switch from recession to normal periods. Several solutions such as heterosce- 
datic ARCH, GARCH models, threshold models, multilayer perceptrons or au- 
toregressive switching Markov models were proposed to overcome these prob- 
lems. 

In this paper, we consider models which allow the series to switch between 
regimes and more particularly we study the case of mixtures of multilayer per- 
ceptrons. In this frame, rather than using a single global model, we estimate 
several local models from the data. For the moment, we assume that switches 
between different models occur independently, the next step of this approach 
being to also learn how to split the input space and to consider the more general 



case of gated experts or mixtures of experts models (Jacobs et al., 1991). The 
problem we address here is how to select the number of components in a mix- 
ture of multilayer perceptrons. This is typically a problem of non-identifiability 
which leads to a degenerate Fisher information matrix and the classical chi- 
square theory on the convergence of the likelihood ratio fails to apply. One 
possible method to answer this problem is to consider penalized criteria. The 
consistency of the BIG criterion was recently proven for non-identifiable mod- 
els such as mixtures of densities or hidden Markov models (Keribin, 2000 and 
Gassiat, 2002). We extend these results to mixtures of nonlinear autoregressive 
models and prove the consistency of a penalized estimate for the number of 
components under some good regularity conditions. 

The rest of the paper is organized as follows : in Section 2 we give the defi- 
nition of the general model and state sufficient conditions for regularity. Then, 
we introduce the penalized likelihood estimate for the number of components 
and state the result of consistency. Section 3 is concerned with applying the 
main result to mixtures of multilayer perceptrons. Some open questions, as well 
as some possible extensions are discussed in the conclusion. 

2 Penalized-likelihood estimate for the number 
of components in a mixture of nonlinear au- 
toregressive models 

This section is devoted to the setting of the general theoretical frame : model, 
definition and consistency of the penalized-likelihood estimate for the number 
of components. 

The model - definition and regularity conditions 

Throughout the paper, we shall consider that the number of lags is known 
and, for case of writing, we shall set the number of lags equal to one, the 
extension to I time-lags being immediate. 

Let us consider the real-valued time series Yt which verifies the following 
model 

(1) Yt=F^s.^{Yt-i)+ee.,{t), 
where 

• Xt is an iid sequence of random variables valued in a finite space {1, ...,]3o} 
and with probability distribution 7r° ; 

• for every i E {1, ...,po}, Fg, (y) G T and 
T = {Fe, 6* e e, e c R' compact set} 

is the family of possible regression functions. We suppose throughout the 
rest of the paper that _Fg, arc sublinear, that is they are continuous and 
there exist {a\, 6°) e E^^ such that 1^° {y)\ < 0° \y\ + 6°, (V) y G M ; 



• for every i € {l,...,po}, (eOi (0)t is an iid noise such that eg. {t) is inde- 
pendent of {Yt^k)k>i- Moreover, e^^ (t) has a centered Gaussian density 

fO 

The subhnearity condition on the regression functions is quite general and 
the consistency for the number of components holds for various classes of pro- 
cesses, such as mixtures of densities, mixtures of linear autoregressive functions 
or mixtures of multilayer perceptrons. 

Let us also remark that besides its necessity in the proof of the theoreti- 
cal result, the compactness hypothesis for the parameter space is also useful in 
practice. Indeed, one needs to bound the parameter space in order to avoid nu- 
merical problems in the multilayer perceptrons such as hidden-unit saturation. 
In our case, 10^ seems to be an acceptable bound for the computations. On the 
other hand, mixture probabilities are naturally bounded. 

The next example of a linear mixture illustrates the model introduced by 
(1). The hidden process Xt is a sequence of iid variables with Bernoulli(0.5) 
distribution. We define Yt as follows, using Xt and a standard Gaussian noise 
£t ■■ 

r 0.5rt-i +et ,ifXt = l 
* \ -Q.bYt-i+et ,ifXt^O 

The penalized-likelihood estimate which we introduce in the next subsection 
converges in probability to the true number of components of the model un- 
der some regularity conditions on the process Yt. More precisely, we need the 
following hypothesis which implies, according to Yao and Attali (2000), strict 
stationarity and geometric ergodicity for Yt : 

(Hs) Er:i'r°i«?r<i 

Let us remark that hypothesis (HS) does not request every component to 
be stationary and that it allows non-stationary "regimes" as long as they do 
not appear too often. Since multilayer perceptrons are bounded function, this 
hypothesis will be naturally fulfilled. 

Construction of the penalized likelihood criterion 

Let us consider an observed sample {yi, ..., y„} of the time series Yt. Then, 
for every observation yt, the conditional density with respect to the previous 
yt-i and marginally in Xt is 

/ (yt I yt-i) - J2 <^l (y* - Pi (2^*-i)) 
1=1 

As the goal is to estimate poj the number of regimes of the model, let us 
consider all possible conditional densities up to a maximal number of regimes 
P, a fixed positive integer. We shall consider the class of functions 



Gp =[JGp, Gp^ Iglg (yi, 2/2) = Yl ^'^<^^ (2/2 - Fe^ (yi)) \ , 

p=l I i=l J 

where tt^ > 77 > 0, X]r=i '''i ~ 1' -^9; (j/) ^ -^ ^^'^ foi i^ ^ centered Gaussian 
density. 

For every g G Qp we define the number of regimes as 

P{g) ^minipe {1,...,P}, g £ Gp} 

and let po = p (5*') be the true number of regimes. 

We can now define the estimate p as the argument p £ {!,..., P} maximizing 
the penalized criterion 

(2) r„ (p) ^ SUPg^g^ In (g) - an (p) 

where 

n 

ln{g) ^Ylng{yt^i,yt) 

t=2 

is the log-likelihood marginal in Xk and a„ {p) is a penalty term. 

Convergence of the penalized likelihood estimate 

Several statistical and probabilistic notions such as mixing processes, brack- 
eting entropy or Donsker classes will be used hereafter. For parcimony purposes 
we shall not remind them, but the reader may refer to Doukhan (1995) and Van 
der Vaart (2000) for complete monographs on the subject. 

The consistency of p is given by the next result, which in an extension of 
Gassiat (2002): 

Theorem 1 : Consider the model (Y^, Xk) defined by (1) and the penalized- 
likelihood criterion introduced in (2). Let us introduce the next assumptions : 

(Al) Qn (•) is an increasing function of p, a„ (pi) — a„ (^2) -^ 00 when 
n — » 00 for every pi > p2 and S21I2I _> q when n — > 00 for every p 

(A 2) the model (Yfe, Xk) verifies the weak identifiability assumption (HI) 

P Po P Po 

E ^^f^ (2/2 - F. (yi)) = E ^°/° (2/2 - F^ (2/1)) ^ J2 ^^'^«^ ^ E "^'^00 

i— 1 i=l 2—1 2—1 

(A3) the parameterization 9i -^ fi (j/2 — Fi (j/i)) is continuous for every 
(2/172/2) o-nd there exists m{yi,y2) an integrable map with respect to the station- 
ary measure 0/ (Y/j, Y/j-i) such that \log (g)\ < m 

(A4) Yk is strictly stationary and geometrically [3-mixing, and the family 
of generalized score functions associated to Qp 



f-1 



L^(p) 



^ C £2 (m) 



S^\sg, sg {y,,y2) - #Hf^, 9 e Gp, 

where fi is the stationary measure of (Yfe, Yfe_i) and for every e > 

H[.]{e,S,\\-\\^} = 0{\loge\), 

7i[.] (e, 5, ll'llj) being the bracketing entropy of S with respect to the L2- 
norm. 

Then, under hypothesis (A1)-(A4-), (HS) et (HC), p ^ Po in probability. 

Proof of Theorem 1 

First, let us show that p does not overestiraate po- We shaU need the foUow- 
ing hkehhood ratio inequahty which is an immediate generahzation of Gassiat 
(2002) to multivariate dependent data. 

Let Q C Qp be a parametric family of conditional densities containing the 
true model g'^ and consider the generalized score functions 



Sg{yi,V2) 



g{vi,y-2) 
g°(yi:y2) 



1 



- 1 



L2(p) 



where /i is the stationary measure of {Yk^i^Yk). Then, 

supgeg [In {9) -~ln{9 )) < T^supgeg ^^ — --j- -; 

with 

(sg)_ (yfc-1, yfc) = min (0, Sg (jjk-i,yk))- 

Then we have : 

p 

F(p>po)< Y. P(r„(p)>T„(po)) = 



^ P (supg^gjn (.9) - a„ (p) > SUPgfzg^Jri (ff) - «« (Po)) < 



P 

P=Po + l 



1 {Y.L2^9{y^-uYk)r 

^supg^g^ — > a„ (p) - a„ (po) 



Under the hypothesis (HS), there exists a unique strictly stationary solution 
Yfc which is also geometrically ergodic and this implies that Yk is in particular 
geometrically /3-mixing. Then, by remarking that 



/3(Vfc-l,n) ^ on 

we obtain that the bivariate series (Yk-i,Yk) is also strictly stationary and 
geometrically /3-mixing. 

This fact, together with the assumption on the e-bracketing entropy of S 
with respect to the |Hli2(„) norm and the condition that S C £2 (m) ensures 
that Theorem 4 in Doukan, Massart and Rio (1995) holds and 

is uniformly tight and verifies a functional central limit theorem. Then, 

2 

= Op(l) 



\k=2 I 



On the other hand, S <Z C2 (p), thus S^ C Ci (/i) and using the >C2-entropy 
condition S^ ~ < (sg) , g (z Gpf is Glivenko-Cantelli. Since (yfe-ii^fe) is er- 
godic and strictly stationary, we obtain the following uniform convergence in 
probability : 



1 " 
'i'n'fgegp:^^—j^(sgf {Yk-i,Yk) — >„^oo infgeg^ ||(sg)_ 



|2 



fe=2 



To finish the first part, let us prove that 

infgeg^ \\{sg)_\\2 > 

If we suppose, on the contrary, that infg^g^ (■^g) 2 ~ ^^ then there exists 
a sequence of functions (sg„)„>]^ , gn ^ Gp such that ||(sg„) II2 -^ 0. The L2- 
convergence implies that (sg^) ^ in Li and a.s. for a subsequence Sg^ ^. 
Since J Sg^d^i — and Sg^ = (sg„) + (■Sg„)^, where (sg„)_|_ = max (0, Sg^), we 

obtain that J (sg^), d^ ~ — J (sg„)_ d^ ~ J (sg„)_ d^ and thus (sg^), -^ 

in Li and a.s. for a subsequence Sg^^,. The hypothesis (A4) ensures the 
existence of a square-integrable dominating function for S and, finally, we get 
that a subsequence of Sg^ converges to a.s. and in L2, which contradicts the 
fact that / s^dfi = 1 for every g E Gp, so that : 

Efc=2(«ff). {yk-i,Yk) 
Then, by the uniform tightness above and the hypothesis (Al), 



P (P > Po) 'n^oo 

Let us now prove that p does not underestimate po : 

Po-l 

F (p < Po) < ^ F (T„ (p) > T„ (po)) < 



Po-l 



fsupg^g^ {In (g) - In {g°)) a„ (p) - a„ (po) 



^ y^ p ■^"-/^ge^p V^» \yj - •.n \y jj ^ 

P=i \ / 

Now, l„ (g) — In (.g°) ~ X]fe=2 ^" ( o/y'^' y ) ) and under the hypothesis 

(A3), the class of functions < In-^, <? G 5p [ is P-Ghvenko-CanteUi (the general 

proof for a parametric family can be found in Van der Vaart, 2000) and since 
(Ifc-ijlfe) is ergodic and strictly stationary, we obtain the following uniform 
convergence in probability : 

——supg^g^ {In (.9) - In (s")) — > supg^g^ / ln—g°d^i 

n I J g 

Since p < Po and using assumption (A2), the limit is negative. By hypoth- 
esis (Al), °" „"l°i" converges to when n -^ 00, so we finally have that 
F (p < Po) — > and the proof is done. 



3 Mixtures of multilayer perceptrons 

In this section, we consider the model defined in (1) such that, for every i G 
{1, ...,po}, -Ff is a multilayer perceptron. Since non-identifiability problems also 
arise in multilayer perceptrons (see, for instance, Rynkiewicz, 2006), we shall 
simplify the problem by considering one hidden layer and a fixed number of 
units on every layer, k. Then, we have that for every i G {1, ...,po} 

where (j) is the hyperbohc tangent and 

^i = ("O >«! :--,"fe >/3o,l'/5l,l,--,/3o,fc'/3i;fe.cr ' j 

is the true parameter with the true variance. Let us check if the hypothe- 
sis of the main result of section 2 apply in the case of mixtures of multilayer 
perceptrons. 



Hypothesis (HS) : The stationarity and ergodicity assumption (HS) is im- 
mediately verified since the output of every perceptron is bounded, by construc- 
tion. Thus, every regime is stationary and the global model is also stationary. 

Let us consider the class of all possible conditional densities up to a maximum 
number of components P > : 

Gp = Up=i Gp, Gp = {9\9 (2/1,2/2) == Er=i '^^f^ (2/2 - P^ (2/i))}, where 

• X]r=i "'4 = 1 H-iid wc may suppose quite naturally that for every i S 
{l,...,p}, TTj > 77 > 

• for every i G {1, ...,p}, P^ is a multilayer perceptron 

F^ (2/) - a^ + E,t 1 aj0 {Phj + Pl,y) , where 



0,.= 



al,a\, ...,al,(3Qi,(3l-^, ...,/3q j,, /3J ,j,, ctM belongs to a compact set. 



Hypothesis (Al) 



(•) may be chosen, for instance, equal to the BIC 



penalizing term, a„ (p) = ip log (r 

Hypothesis (A2)-(A3) : Since the noise is normally distributed, the weak 
identifiability hypothesis is verified according to the result of Teicher (1963), 
while assumption (A3) is a regularity condition verified by Gaussian densities. 

Hypothesis (A4) : We consider the class of generalized score functions 



5 = 



f: 

'9' ■^9 - ||9_i| 



-, 5 e Gp, 



1 



lHp) 



^0 



The difficult part will be to show that Hi] {e,S, ||-||2) ~ O (|/og£|) for all 
e > which, since we are on a functional space, is equivalent to prove that "the 
dimension" of S can be controlled. For g G Qp, let us denote = {9i,...,9p) 
and TT = (tti, ...,7rp), so that the global parameter will be $ — (^j'"') and the 
associated generalized score function s$ := Sg. 

Proving that a parametric family like S verifies the condition on the brack- 
eting entropy is usually immediate under good regularity conditions (see, for 
instance. Van der Vaart, 2000). A sufhcient condition is that the bracketing 
number grows as a polynomial function oi -. In this particular case, the prob- 
lems arise when g ^ f and the limits in L^ (^) of Sg have to be computed. Let 
us split S into two classes of functions. We shall consider Tq C Gp s. neighbor- 

< e > and So = {sg, g G J^o}- 



hood of / such that To = s 9 ^ Gp, 
On S \So, it can be easily seen that 



f-1 



i"(M) 



£L_i 
_^£ 



92 _i 

_f 



1l.2(j,) 






Hence, on S\So, it is sufhcient that 



< 



L2(m) 



L2(^) 



9]_ _92_ 
f f 



< 



for 





£i 


- 1 




32 


- 1 






/ 


/ 


II 




91 _ 
/ 


- 1 


2 


92 _ 


-1 


2 



< e. 



Now, iS \ iSo is a parametric class. Since the derivatives of the transfer 
functions are bounded, according to the example 19.7 of Van der Vaart (2000), 
it exists a constant K so that the bracketing number of S^ is lower than 



K 



/ dianiGj; 



3(fc+l)P 



K 



I -y/diamGp 



6(fc+l)P 






where diamGp is the diameter of the smallest sphere of K'^('^+^)^ including the 

set of possible parameters. So, we get that A/]] (e,5 \ 5o, ||-||2) = C (-) , 

where A/li (e, S \ Sq, H'IIj) is the number of e-brackets necessary to cover S\Sq 
and the bracketing entropy is computed as logAf[] (e, S \ Sq, ||'||2)- 

As for So, the idea is to reparameterizc the model in a convenient manner 
which will allow a Taylor expansion around the identifiable part of the true 
value. For that, we shall use a slight modification of the method proposed by 
Liu and Shao (2003). Let us remark that when 4 — 1 = 0, the weak idcntifiability 
hypothesis (A2) and the fact that for every i e {1, ...,p}, tt^ > rj > 0, imply that 
there exists a vector t ~ (ti)o<i<p such that = to < ^i < ••• < tpo — P ^-^d, 
modulo a permutation, $ can be rewritten as follows : 9t-_-^+i ~ ... = 6t^ ~ 9^, 
^ '^(._ 1]^ TTj = TT^, i E {1,...,pq}. With this remark, one can define in the 



general case s 
j e {t,_i + 1, 



Si 






L+1 



^j - Trr, qj 



^i=ti-i+i ' 



and the new parameterization will be Oj = (</)(, t/ij), 

^t = ((^j)i<j<p'(s^)i<*<po-i)' ^* " (9i)i<j<p' with (j)t containing aU the 
identifiable parameters of the model and ipt the non-identifiable ones. Then, for 
g = f, we will have that 



.,e? 



9" 

Po' 



■,0L o,...,o)^ 



This reparameterization allows to write a second-order Taylor expansion of 
4 — 1 at (/)( . For ease of writing, we shall first denote 



9j{yi,y2) ^99, iyi,y2) 

Then, the density ratio becomes : 



f,{y2-F,{vi)) 

E2i</°(y2-F0(yi)) 



f - 1 == Er=i {^^ + <) E}L*._,+i ?,.9, + ( < - lit, sA E74„_,+i 9.5, 



^po — 1 



By remarking that when (fit — (f)^, j docs not vary with ipt, we will study 
the variation of this ratio in a neighborhood of 0" and for fixed ipt- Let us note 
g|^ the vector of derivatives of gj with respect of each components of 9j and 

-^^ the vector of second derivatives of gj with respect of each components of 

Oj. Assuming that (.9j)i<j-<p, (ffj)i<^<p and (.9")i<j<p, where 

are linearly independent in L^ (/i), one can prove the following : 



Proposition 1 : Let us denote D {(fit, ipt) 



f 



fixed Ipt, there exists the second-order Taylor expansion at (j>t ■ 



LH^i) 



For any 



with 



Pa 



U=ti-i+l 




and 



Pa 



i*-^?) 3W.^.)(^*-^")=E 




+< E 9,(^,-e°)%f(^.-^" 



Moreover, 



Proof of Proposition 1 

The first term in the developpement can be computed easily by remarking 
that the gradient of -%■ — 1 at ((p'},tpt) is : 



for i e {l,...,po} and j e {ti_i + l,...,ti}, 



(0O,^,)=^Og,,9: 



10 



for « G {l,...,po- 1}, 



ds. 



i't) = E/=t._i+i <lj9e<^ - E/=t,„_i+i <lj98<^^ = 5e« - 9eo^ 



The term of second order can be obtained by direct computations once the 
hessian in computed at (0°,V't): 



a^Mr-i 



a^Mfr-i 



06,301 



d 



2 g 



dsidsk 



dsidOi 



^"iV"*) ^A^ijd'i , « = l,---,Po and j = tj_i + l,...,ti 

i't^'^t) ^0 , j,l^ l,...,pand j 7^/ 

^°,^t) = , i,A:= l,...,po- 1 

^°i'0t) ^'ZiSi , « = l,---,Po- 1 and j = ti_i + l,...,ti 



dsidOj 



'V"*) = -'J'jSpo , * = l,---,Po - 1 and j = tp^-i + l,...,tp 



• the other crossed derivatives of Si and 6j are zero 

It remains to prove that the rest is o ( i/ii — ^4 ) but this foUows directly from 
Yao (2000) and the fact that, since the noise is normaUy distributed, Yt has 
moments of any order. 



Using the Taylor expansion above, for 9 belonging to §0, f — 1 is the sum 
of a linear combination of 

V{z) := i^gi,--- ,9p,9i,--- ,9p,9i,--- ,9p) 

and of a term whose L^ norm is negligible compared to the L^ norm of this 
combination when e goes to 0. By assumption (A3), a strictly positive number 
m exists so that for any vector of norm 1 with components 



C — (ci , • • • , CpQ X (3fe+l) jdi,- ■ ■ ,dp^x (3fe+l) 761, 

and e sufficiently small: 



2pox(3fe+l); 



\\C^V{z)\\2>m + e. 



Since any function 



can be written: 



C^Viz) + o{\\C^V{z)h) 
\C^V{z) + oi\\C^V{z)h)\\ 



11 



§0 belongs to the set of functions: 

D'^V{z) + o(l), ||Z?||2 < 1) c {d'^V{z) +^^\\Dh<-, |7| < 1 

whose bracketing number is smaller or equal to O (-) ° 
and the assumptions of Theorem 1 are verified I 



4 Some numerical examples 

The theoretical result proven above may be applied in practice to compute the 
number of components in a mixture model on simulated or real data. Some 
examples are presented below, illustrating the stability and the speed of conver- 
gence of the algorithm. 

4.1 Mixtures of linear models 

Let us first consider the particular case of linear models, corresponding to zero 
hidden- unit perceptrons. The examples are mixtures of two autoregressive mod- 
els in which we vary the leading coefficients and the weights of the discrete 
mixing distribution. For every example, we simulate 20 samples of lengths 
n = 200, 500, 1000, 1500, 2000 and we fix P == 3 the upper bound for the number 
of regimes. 

The likelihood is maximized via the EM algorithm (see, for instance, Demp- 
ster, Laird and Rubin (1977) or Redner and Walker, 1984). This algorithm is 
well suited to find a sequence of parameters which increases the likelihood at 
each step, and so converges to a local maximum for a very wide class of models 
and for our model in particular. The idea of the EM algorithm is to replace the 
latent variables of the mixture by their conditional expectation. A brief recall 
on the main steps of the algorithm is given below : 

1. Let X be the vector containing the component of the mixture and con- 
sidered as a latent variable and let y = (l/ij ■ • • , J/n) be the vector of 
observations. 



2. Initialization : Set k ~ and choose 



'0 



3. E-Step : Set 9* = 9k and compute Q{.,d*) with 

Q{e,9*) = Eg, ln(^^^f^] where Lg{y,X) is the likelihood of the 
observations and the vector of mixture X for the parameter 9. This step 
computes the probabilities of the mixture conditionally to the observations 
and with respect to the parameter 9* 

4. M-Step : Find : 

^ = argmaxQ(0,r) 
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5. Replace 9k+i by , and go back to step 3) until a stopping criterion is 
satisfied (i.e. when the parameters don't seem to change anymore). 

The sequence (9k) gives nondecreasing values of the likelihood function up to a 
local maximum. Q{6,6*) is called conditional pseudo-log-likclihood. 



To avoid local maxima, the procedure is initialized several times with dif- 
ferent starting values : in our case, ten different initializations provided good 
results. The stopping criteria applies when either there is no improvement in 
the likelihood value, either a maximum number of iterations, fixed at 200 here 
for reasonable computation time, is reached. 

The true conditional density is 

5° {yi,y2) - <./? (2/2 - F? (y,)) + (1 - TT?) /o (y2 - F^ (y,)) 

with F^ (yi) = flOyi + 6° and / « - AA (o, {cr^f^ for i e {1, 2}. 

For every example, we pick equal standard errors a^ — (J2 — 0.5, fej = 0.5 
and 62 = —0.5 and let vary the rest of the coefficients: tt^ £ {0.5,0.7,0.9}, 
a1,a^G {0.1,0.5,0.9}. 

Let us focus in one particular example. Figure 1 illustrates one out of the 
twenty samples in the case n — 1000, ttJ = 0.7, aj = 0.1 and 0° ~ 0.5. The 
observed values of the series Yj are plotted on the first graph. On the second 
graph, we represent the convergence of the estimates for the mixture probabil- 
ities (solid and dashed lines) to the true values (dotted lines). Only the best 
result from the different initializations of the EM algorithm was drawn. 

The summary of the results for all the examples is given by Table 1. In 
almost every case, the convergence is reached for the samples containing 2000 
inputs. In practice, the results will be then more or less accurate, depending 
on the size of the sample, but also on the proximity of the components and on 
their frequency. 
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Number of iterations 



Figure 1: Series Yt and estimates for mixture probabilities tv^ 
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Table 1: Number of components for b^ = 0.5 and 62 = —0.5 



4.2 Laser time series 

A second example studies the complete laser series of "Santa Fe time series 
prediction and analysis competition" . The level of noise in this series is very 
low, the main source of noise being the errors of measurement. We use the 
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12500 patterns for estimation. The Figure 2 shows the last 1000 patterns. The 
series is supposed to be stationary, and recall from Section 2 that a piecewise 
stationary time series is globally stationary if every component is stationary 
itself. The mixture of expert models is an example of piecewise stationary and 
globally stationary time series. 




Figure 2: 1000 last Patterns of the laser series 

We want to choose the number of components of the mixture by minimizing 
the BIC criteria. Based on previous study (Rynkiewicz, 1999) we choose to 
use experts with 10 entries, 5 hidden units, one linear output, and hyperbolic 
tangent activation functions. We want to know the number of experts to use 
with such series. As this is a real application, it is impossible to check the 
main assumption of our theory : the true model belongs to the set of possible 
models. However, we want to know if the developed theory can give an insight 
for choosing the number of experts. 

The parameters are estimated using the standard EM algorithm. In order 
to avoid bad local maxima, estimation is performed with 100 different initializa- 
tions of model parameters. For each estimation we proceed with 200 iterations 
of the EM algorithm and for each M-step we optimize the parameter of the 
MLPs until their error of prediction doesn't improve anymore. 

The estimated loglikelihood and the estimated probability of the mixture 
are the following : 



number of experts 


BIC 


probabilities of the mixture 


1 
2 
3 


-32.16894 
-25.92 
-38.42 


1 

(0.8,0.2) 

(0.7,0.21,0.09) 



The results are clear for our model, the best model is the model with two 
experts. It is difficult to give an interpretation of the regimes because mixing 
probabilities remain constant over time. However, if we look at the prediction 
made by each expert, we can see that one expert seems to be specialized in the 
general regime of the series and the second one with the collapse regime. 

The proposed method gives an insight on the way to choose the number of 
experts in a mixture model for laser time series. However, since the probabilities 
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of the mixture are constant, it would be better to choose probabiHties depending 
on the previous value of the time series as in the gating expert of Weigend et al. 
(Weigend et. al, 1995) or of the time as in hybrid hidden Markov/MLP Models 
(Rynkiewicz, 2006). The prediction error of this simple mixture model is not 
competitive with such more complex models, however we need to improve the 
theory to deal with such complex modeling. 

5 Conclusion and future work 

We have proven the consistency of the BIC criterion for estimating the number 
of components in a mixture of multilayer perceptrons. In our opinion, two 
important directions are to be studied in the future. The case of mixtures should 
be extended to the general case of gated experts which allow the probability 
distribution of the multilayer perceptrons to depend on the input and thus, 
to learn how to split the input space. The second possible extension should 
remove the hypothesis of a fixed number of units on the hidden layer. The 
problem of estimating the number of hidden units in one multilayer perceptron 
was solved in Rynkiewicz (2006), but it would be interesting to mix the two 
results and prove the consistency of a penalized criterion when there is a double 
non-idcntifiability problem : number of experts and number of hidden units. 
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